Speed up CI and test adjoint in parallel#438
Conversation
connorjward
left a comment
There was a problem hiding this comment.
also adding @pytest.mark.timeout(300) for all our parallel tests
I recommend setting --timeout 300 --timeout-method thread for this (example).
|
Thanks for the comments @connorjward - I'll take a look in detail again soon and probably have some follow up questions! |
|
Hi @connorjward - "soon" was a lot later than I thought it would be. Would you mind reviewing this?
I've done this. I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this? |
Yeah I'm pretty sure that splitting like that doesn't fit with what GitHub allows. I do wonder why they are separate jobs though. You are repeating the same setup and teardown for equivalent configurations. I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs. |
This is what we currently have, which is what causes testing to be so slow:
So because the slow adjoint test isn't tested alongside the regular tests, it dominates the CI time. This slow test ( Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately. It would keep the diagnostic separation between adjoint and regular tests but speed things up by separating the examples from the non-example tests. Merging the examples together would then allow Thoughts @stephankramer? |
|
Ah sorry, I missed the motivation behind all this.
This does seem more natural to me. Instead of having separate jobs maybe another approach is to run the non-example tests as an earlier step and only do the examples at the end. That way you still get fast feedback if things are breaking. |
|
The latest commit is to just tackle the main problem for CI speed directly, which is the Tohoku tsunami example. From my understanding:
I've reduced the grid to 2x2 for testing. With that, the current testing (pre-PR) format is fine & much faster again. But we can also keep the examples separate (current PR approach) - I'm happy either way. |
|
That all looks sensible now. The split in adjoint/non-adjoint tests used to be necessary because the tape started running immediately when import firedrake_adjoint. Now the only thing that's different is that teardown thing that you saw, which could also be run on "normal" tests (in the way you've changed it). Anyway, splitting in tests/examples/adjoint is also fine. I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that. One final request though. Adding (large-ish) binary-files is not ideal. Could you change it to a sym-link? I think you can just "git rm", then locally make a (weak) symlink and "git add" it back - alternatively you could do some shutil.copy() in the test with a relative path? Either way, if you remove the mesh.msh file, this is a case where it's good to rewrite history (on the branch!): |
a945a25 to
608f575
Compare
stephankramer
left a comment
There was a problem hiding this comment.
Wonderful, all good afaic!
@stephankramer - I've added a mechanism to test the examples in parallel (commit 1). Since I did that for the adjoint example we wanted to rest, I also then did the same for the normal examples (commit 2 - for future use, and for tidal_array.py which is currently considerably slower than the other tests).
Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently. Everything works locally, except I need that teardown test guard or otherwise I get an AttributeError when no tape exists. Apologies to add more review work, but it does remove the duplication which is what you wanted. If this is not the right approach (or you want to leave it as is), I can just undo the commits and merge once the checks have complete. |
This seems like a useful contribution. Could this be upstreamed? |
Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and |
I can imagine cases where users may not have it installed. Certainly isn't critical though. |
| from mpi4py import MPI | ||
| comm = MPI.COMM_WORLD | ||
| if comm.rank == 0: | ||
| workdir = tmp_path_factory.mktemp("thetis-example-tidal-array") |
There was a problem hiding this comment.
Should this be named tidal-array if in principle can be extended to other examples?
There was a problem hiding this comment.
Actually why are we special-casing parallel here at all, why not just use tmp_path?
|
See #459. |
|
@connorjward (& @stephankramer) - I will add another PR for
The workaround tests work successfully (I just forgot to exclude the script from linting). Even prior to this PR, we saw this hanging behaviour sporadically when doing 2-core MPI tests consecutively rather than concurrently (e.g. https://github.com/thetisproject/thetis/actions/runs/24978421516 - logs are gone but this was an instance). I also suspect there are a lot of processes that need to be killed on the runners, but I don't have access to them. |
|
Changes for Firedrake are now done in
|
|
@connorjward for the Thetis Do you want me to follow these instructions to merge these changes into |
I am handling this in firedrakeproject/firedrake#5178. Should go through tonight or tomorrow morning. |
|
And for your |
@stephankramer thoughts? It would mean we are not testing exactly what a user is installing on |
I prefer it this way because then you're testing what a user is going to encounter soon. It gives you time to adapt to anything that breaks before it hits users. It also means that we will hear about breakages sooner. |
That makes sense. The difference between I'm fine to switch but will leave it with Stephan to decide. |
|
Yeah, I can see the argument for both and indeed the chances that it makes a difference are pretty small - but I think the advantages of "knowing things will break before the users notice" outweigh "knowing things are broken for users", so agree with @connorjward, let's switch to dev-release. It's good for us to spot if a "fix" in firedrake release (unadvertedly cause it shouldn't) breaks things for us, so we can report and get it fixed before the next patch release. If we notice a bug outside our CI (say user report) due to a firedrake bug which is then fixed in firedrake release, it may be good if we can then immediately add a regression test in Thetis. Finally, :latest does not get rebuild regularly, so our weekly tests would be repeating the same exercise most of the time, and not spot any other "environment" changes. |
4740f06 to
9e84b9c
Compare
|
Re-based into 3 clean commits - as reviewed. I have the back-up branch locally if needed but should be good to go. Should be rebased and merged to keep these three commits separate. |
stephankramer
left a comment
There was a problem hiding this comment.
Excellent, many thanks
|
@connorjward could you take one more look if you're happy? |
connorjward
left a comment
There was a problem hiding this comment.
Some comments but ultimately this is CI: if it works in the way that you want then I would call it done and get on with more interesting work!
| -m parallel[2] thetis-repo/test | ||
| : # Split the parallel tests into multiple mpiexec jobs to utilise all cores | ||
| export FIREDRAKE_RUN_SPLIT_TESTS_TIMEOUT=660s | ||
| export FIREDRAKE_RUN_SPLIT_TESTS_KILL_AFTER=30s |
There was a problem hiding this comment.
You can set this at the top level of the file (example)
|
|
||
|
|
||
| def test_examples(example_file, tmpdir, monkeypatch): | ||
| def test_examples(example_file, tmp_path, monkeypatch, request): |
There was a problem hiding this comment.
We do this sort of thing by importing the file: https://github.com/firedrakeproject/firedrake/blob/main/tests/firedrake/demos/test_demos_run.py#L129
This seems totally fine though.
Depends on #437. Closes #426.
Testing adjoint in (MPI) parallel:
For the MPI parallel adjoint tests, I have used the channel-optimisation example and copied the mesh across, because I couldn’t figure out how to add a subdomain to a
RectangleMesh. We could move this into a channel-optimisation directory if that’s cleaner, or alternatively switch to a simplified headland inversion example with regions defined asConstantcontrols. Either way, this requires a duplicate test. With the current example testing setup, we don’t have a way to select whether tests are run in serial or parallel. This could be updated so that along with each adjoint example, we specify whether it is serial or parallel, that would avoid the duplication.I've noticed that every now and again the Thetis MPI parallel tests hang and can take hours to complete, it might be worth also adding
@pytest.mark.timeout(300)for all our parallel tests. I don't think this is because of Thetis itself but just due to MPI collectives or repeated pytest-xdist collection hanging.Speeding up CI test suite:
It’s not possible to split matrix jobs on a single runner (as far as I can tell). I’ve implemented a matrix strategy that splits between runners, but we could revert to running everything on a single runner if preferred. Another option would be to merge the main and adjoint serial tests, although I prefer keeping them separate for clarity and cleaner outputs. If we stick with this then we need to update the tests that are required to pass for merging.